Which companies use R


drawing For advertising effectiveness and economic forecasting.
drawing Acquired Revolution R company and use it for a various purposes.
drawing For behavior analysis related to status updates and profile pictures.
drawing For data visualization and semantic clustering.
drawing For statistical analysis.
drawing Scale data science.
drawing For data curation, analysis and visualisation.
And many more…

Why use R, Even python can do all this stuff

Think of R like a cat!
And Python like dog!
Both are great pets to have, Some people like one over the other. But at the end of the day both are amazing
The problem starts when someone looks at R and expects it to be a dog
“You’re dog is broken!”
R has some strange parts, but it compenstates with some great parts. They are not just good, but great
Some parts of R are better than python and some parts of python are better than R.

Acquiring the data

Data can be acquired from many sources into R. R supports data formats like csv, xlsx, spss, sas or any remote database like MySQL, SQLite, PostgreSQL, MonetDB, etc

The most used methods are to read data from a csv, xlxs or txt file or connecting to MySQL or SQLite data base
drawing Used for obtaining rectangular data into R like “csv”, “tsv”, and “fwf”
drawing Used to import excel files into R
drawing R interface to Apache Spark to work with big data
drawing Manage Google Drive files from R.
drawing Interact with Google Sheets from R.
drawing This package is wrapped around the ‘xml2’ and ‘httr’ packages to make it easy to download and manipulate

Reading local data

We can read a .csv data using the base read.csv() function or using read_csv() function from the readr package

data <- read.csv("datasets/adult_data.csv")
names(data) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "gender", "capital_gain", "capital_loss", "hours_per_week", "native_country", "predictive_variable")
head(data)
##   age         workclass fnlwgt  education education_num
## 1  50  Self-emp-not-inc  83311  Bachelors            13
## 2  38           Private 215646    HS-grad             9
## 3  53           Private 234721       11th             7
## 4  28           Private 338409  Bachelors            13
## 5  37           Private 284582    Masters            14
## 6  49           Private 160187        9th             5
##           marital_status         occupation   relationship   race  gender
## 1     Married-civ-spouse    Exec-managerial        Husband  White    Male
## 2               Divorced  Handlers-cleaners  Not-in-family  White    Male
## 3     Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
## 4     Married-civ-spouse     Prof-specialty           Wife  Black  Female
## 5     Married-civ-spouse    Exec-managerial           Wife  White  Female
## 6  Married-spouse-absent      Other-service  Not-in-family  Black  Female
##   capital_gain capital_loss hours_per_week native_country
## 1            0            0             13  United-States
## 2            0            0             40  United-States
## 3            0            0             40  United-States
## 4            0            0             40           Cuba
## 5            0            0             40  United-States
## 6            0            0             16        Jamaica
##   predictive_variable
## 1               <=50K
## 2               <=50K
## 3               <=50K
## 4               <=50K
## 5               <=50K
## 6               <=50K

In order to obtain data from remote database like SQLLite First we need to establish a connection to the database

con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")

Then we can use this connection object to access and edit the database

dbListTables(con)
## [1] "iris"   "mtcars"
mtcarsData <- dbReadTable(con, "mtcars")
str(mtcarsData)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dbDisconnect(con)

Cleaning the data

In the real world the data is not always “clean”. There are many ways to define clean. Common things to look out for:


drawing dplyr is one of the most used package for data wrangling in R
drawing Also a very popular package for data wrangling
drawing Used for string manupulation
drawing Used to work with dates data
drawing Used to work with time data

Understanding the data

str(data)
## 'data.frame':    32560 obs. of  15 variables:
##  $ age                : int  50 38 53 28 37 49 52 31 42 37 ...
##  $ workclass          : Factor w/ 9 levels " ?"," Federal-gov",..: 7 5 5 5 5 5 7 5 5 5 ...
##  $ fnlwgt             : int  83311 215646 234721 338409 284582 160187 209642 45781 159449 280464 ...
##  $ education          : Factor w/ 16 levels " 10th"," 11th",..: 10 12 2 10 13 7 12 13 10 16 ...
##  $ education_num      : int  13 9 7 13 14 5 9 14 13 10 ...
##  $ marital_status     : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 1 3 3 3 4 3 5 3 3 ...
##  $ occupation         : Factor w/ 15 levels " ?"," Adm-clerical",..: 5 7 7 11 5 9 5 11 5 5 ...
##  $ relationship       : Factor w/ 6 levels " Husband"," Not-in-family",..: 1 2 1 6 6 2 1 2 1 1 ...
##  $ race               : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 3 5 3 5 5 5 3 ...
##  $ gender             : Factor w/ 2 levels " Female"," Male": 2 2 2 1 1 1 2 1 2 2 ...
##  $ capital_gain       : int  0 0 0 0 0 0 0 14084 5178 0 ...
##  $ capital_loss       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours_per_week     : int  13 40 40 40 40 16 45 50 40 80 ...
##  $ native_country     : Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 6 40 24 40 40 40 40 ...
##  $ predictive_variable: Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 2 2 2 2 ...
summary(data)
##       age                    workclass         fnlwgt       
##  Min.   :17.00    Private         :22696   Min.   :  12285  
##  1st Qu.:28.00    Self-emp-not-inc: 2541   1st Qu.: 117832  
##  Median :37.00    Local-gov       : 2093   Median : 178363  
##  Mean   :38.58    ?               : 1836   Mean   : 189782  
##  3rd Qu.:48.00    State-gov       : 1297   3rd Qu.: 237055  
##  Max.   :90.00    Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)          :  981                    
##          education     education_num                  marital_status 
##   HS-grad     :10501   Min.   : 1.00    Divorced             : 4443  
##   Some-college: 7291   1st Qu.: 9.00    Married-AF-spouse    :   23  
##   Bachelors   : 5354   Median :10.00    Married-civ-spouse   :14976  
##   Masters     : 1723   Mean   :10.08    Married-spouse-absent:  418  
##   Assoc-voc   : 1382   3rd Qu.:12.00    Never-married        :10682  
##   11th        : 1175   Max.   :16.00    Separated            : 1025  
##  (Other)      : 5134                    Widowed              :  993  
##             occupation            relationship  
##   Prof-specialty :4140    Husband       :13193  
##   Craft-repair   :4099    Not-in-family : 8304  
##   Exec-managerial:4066    Other-relative:  981  
##   Adm-clerical   :3769    Own-child     : 5068  
##   Sales          :3650    Unmarried     : 3446  
##   Other-service  :3295    Wife          : 1568  
##  (Other)         :9541                          
##                   race           gender       capital_gain  
##   Amer-Indian-Eskimo:  311    Female:10771   Min.   :    0  
##   Asian-Pac-Islander: 1039    Male  :21789   1st Qu.:    0  
##   Black             : 3124                   Median :    0  
##   Other             :  271                   Mean   : 1078  
##   White             :27815                   3rd Qu.:    0  
##                                              Max.   :99999  
##                                                             
##   capital_loss     hours_per_week         native_country 
##  Min.   :   0.00   Min.   : 1.00    United-States:29169  
##  1st Qu.:   0.00   1st Qu.:40.00    Mexico       :  643  
##  Median :   0.00   Median :40.00    ?            :  583  
##  Mean   :  87.31   Mean   :40.44    Philippines  :  198  
##  3rd Qu.:   0.00   3rd Qu.:45.00    Germany      :  137  
##  Max.   :4356.00   Max.   :99.00    Canada       :  121  
##                                    (Other)       : 1709  
##  predictive_variable
##   <=50K:24719       
##   >50K : 7841       
##                     
##                     
##                     
##                     
## 

Understand every column of the data first

Numerical data fields

Age

ggplot(data, aes(x = data$age)) + geom_bar()

Hours worked per week

ggplot(data, aes(x = data$hours_per_week)) + geom_histogram(binwidth=10)

Categorical data fields

Marital Status

data_marital_status <- data %>% group_by(marital_status) %>% summarise(count = n())
ggplotly(ggplot(data_marital_status, aes(x = reorder(marital_status, count), y = count)) + geom_col() + coord_flip())
ggplot(data_marital_status, aes(x = "", y = count, fill = reorder(marital_status, - count)))+
    geom_bar(width = 1, stat = "identity") +
    coord_polar("y", start=0)

Education

ggplotly(ggplot(data %>% group_by(education) %>% summarise(count = n()), aes(x = reorder(education, count), y = count)) + geom_col() + coord_flip())

Occupation

ggplotly(ggplot(data %>% group_by(occupation) %>% summarise(count = n()), aes(x = reorder(occupation, count), y = count)) + geom_col() + coord_flip())

Relationship

ggplotly(ggplot(data %>% group_by(relationship) %>% summarise(count = n()), aes(x = reorder(relationship, count), y = count)) + geom_col() + coord_flip())

Race

ggplotly(ggplot(data %>% group_by(race) %>% summarise(count = n()), aes(x = reorder(race, count), y = count)) + geom_col() + coord_flip())

Gender

ggplotly(ggplot(data %>% group_by(gender) %>% summarise(count = n()), aes(x = reorder(gender, count), y = count)) + geom_col() + coord_flip())

Native Country

ggplotly(ggplot(data %>% group_by(native_country) %>% summarise(count = n()), aes(x = reorder(native_country, count), y = count)) + geom_col() + coord_flip())

Now try to make hypothesis and test them

Hypothesis 1

People who study more make more money

data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>% group_by(education) %>% summarise(number_of_rich_people = sum(is_rich), number_of_poor_people = n() - number_of_rich_people, total_people = n())
education_summary
## # A tibble: 16 x 4
##    education       number_of_rich_people number_of_poor_people total_people
##    <fct>                           <dbl>                 <dbl>        <int>
##  1 " 10th"                           933                     0          933
##  2 " 11th"                          1175                     0         1175
##  3 " 12th"                           433                     0          433
##  4 " 1st-4th"                        168                     0          168
##  5 " 5th-6th"                        333                     0          333
##  6 " 7th-8th"                        646                     0          646
##  7 " 9th"                            514                     0          514
##  8 " Assoc-acdm"                    1067                     0         1067
##  9 " Assoc-voc"                     1382                     0         1382
## 10 " Bachelors"                     5354                     0         5354
## 11 " Doctorate"                      413                     0          413
## 12 " HS-grad"                      10501                     0        10501
## 13 " Masters"                       1723                     0         1723
## 14 " Preschool"                       51                     0           51
## 15 " Prof-school"                    576                     0          576
## 16 " Some-college"                  7291                     0         7291
print(unique(data$predictive_variable))
## [1]  <=50K  >50K 
## Levels:  <=50K  >50K
print(unique(as.character(data$predictive_variable)))
## [1] " <=50K" " >50K"
salary <- unique(as.character(data$predictive_variable))
str_sub(salary, 2, str_length(salary))
## [1] "<=50K" ">50K"
gsub(" ", "", salary)
## [1] "<=50K" ">50K"
trimws(salary)
## [1] "<=50K" ">50K"
salary
## [1] " <=50K" " >50K"
library(microbenchmark)

microbenchmark(str_sub(salary, 2, str_length(salary)), gsub(" ", "", salary), trimws(salary))
## Unit: microseconds
##                                    expr   min     lq    mean median     uq
##  str_sub(salary, 2, str_length(salary))   4.7   7.95  20.499  14.05  28.25
##                   gsub(" ", "", salary)   5.0  11.10  33.421  16.90  26.75
##                          trimws(salary) 153.0 314.80 552.995 442.55 657.55
##     max neval cld
##   141.7   100  a 
##   640.9   100  a 
##  2409.1   100   b
salary <- str_sub(salary, 2, str_length(salary))
salary
## [1] "<=50K" ">50K"
data$predictive_variable <- str_sub(data$predictive_variable, 2, str_length(data$predictive_variable))

data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>% group_by(education) %>% summarise(number_of_rich_people = sum(is_rich), number_of_poor_people = n() - number_of_rich_people, total_people = n())
education_summary
## # A tibble: 16 x 4
##    education       number_of_rich_people number_of_poor_people total_people
##    <fct>                           <dbl>                 <dbl>        <int>
##  1 " 10th"                            62                   871          933
##  2 " 11th"                            60                  1115         1175
##  3 " 12th"                            33                   400          433
##  4 " 1st-4th"                          6                   162          168
##  5 " 5th-6th"                         16                   317          333
##  6 " 7th-8th"                         40                   606          646
##  7 " 9th"                             27                   487          514
##  8 " Assoc-acdm"                     265                   802         1067
##  9 " Assoc-voc"                      361                  1021         1382
## 10 " Bachelors"                     2221                  3133         5354
## 11 " Doctorate"                      306                   107          413
## 12 " HS-grad"                       1675                  8826        10501
## 13 " Masters"                        959                   764         1723
## 14 " Preschool"                        0                    51           51
## 15 " Prof-school"                    423                   153          576
## 16 " Some-college"                  1387                  5904         7291
education_data <- distinct(data %>% select(education, education_num)) %>% arrange(education_num)
education_data
##        education education_num
## 1      Preschool             1
## 2        1st-4th             2
## 3        5th-6th             3
## 4        7th-8th             4
## 5            9th             5
## 6           10th             6
## 7           11th             7
## 8           12th             8
## 9        HS-grad             9
## 10  Some-college            10
## 11     Assoc-voc            11
## 12    Assoc-acdm            12
## 13     Bachelors            13
## 14       Masters            14
## 15   Prof-school            15
## 16     Doctorate            16
data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>%
    group_by(education, education_num) %>%
    summarise(percentage_of_rich_people = sum(is_rich) / n() * 100) %>%
    arrange(education_num)
education_summary
## # A tibble: 16 x 3
## # Groups:   education [16]
##    education       education_num percentage_of_rich_people
##    <fct>                   <int>                     <dbl>
##  1 " Preschool"                1                      0   
##  2 " 1st-4th"                  2                      3.57
##  3 " 5th-6th"                  3                      4.80
##  4 " 7th-8th"                  4                      6.19
##  5 " 9th"                      5                      5.25
##  6 " 10th"                     6                      6.65
##  7 " 11th"                     7                      5.11
##  8 " 12th"                     8                      7.62
##  9 " HS-grad"                  9                     16.0 
## 10 " Some-college"            10                     19.0 
## 11 " Assoc-voc"               11                     26.1 
## 12 " Assoc-acdm"              12                     24.8 
## 13 " Bachelors"               13                     41.5 
## 14 " Masters"                 14                     55.7 
## 15 " Prof-school"             15                     73.4 
## 16 " Doctorate"               16                     74.1
ggplotly(ggplot(education_summary, aes(x = reorder(education, education_num), y = percentage_of_rich_people)) + geom_bar(stat = "identity") + coord_flip())
ggplotly(ggplot(education_summary, aes(x = education_num, y = percentage_of_rich_people, color = education)) + geom_point())

Hypothesis 2

Let’s just analyze which type of occupation makes more money

occupation_summary <- data %>%
    group_by(occupation) %>%
    summarise(percentage_of_rich_people = sum(is_rich) / n() * 100) %>%
    arrange(percentage_of_rich_people)
occupation_summary
## # A tibble: 15 x 2
##    occupation           percentage_of_rich_people
##    <fct>                                    <dbl>
##  1 " Priv-house-serv"                       0.671
##  2 " Other-service"                         4.16 
##  3 " Handlers-cleaners"                     6.28 
##  4 " ?"                                    10.4  
##  5 " Armed-Forces"                         11.1  
##  6 " Farming-fishing"                      11.6  
##  7 " Machine-op-inspct"                    12.5  
##  8 " Adm-clerical"                         13.5  
##  9 " Transport-moving"                     20.0  
## 10 " Craft-repair"                         22.7  
## 11 " Sales"                                26.9  
## 12 " Tech-support"                         30.5  
## 13 " Protective-serv"                      32.5  
## 14 " Prof-specialty"                       44.9  
## 15 " Exec-managerial"                      48.4
ggplotly(ggplot(occupation_summary, aes(x = reorder(occupation, percentage_of_rich_people), y = percentage_of_rich_people)) + geom_bar(stat = "identity") + coord_flip())

Hypothesis 3

People who work more make more money

ggplot(data, aes(x = hours_per_week, fill = predictive_variable)) + geom_histogram(binwidth = 10)

Hypothesis 4

Men make more money than Women?

ggplot(data, aes(x = gender, fill = predictive_variable)) + geom_bar(stat = "count")

# gender_summary <- data %>%

Hypothesis 5

Hypothesis 6

Predicting the salary

There are many ways to predict a variable, depending on the data type.
If the variable you need to predict is a number, you need to use regression
If the variable you need to predict is a categorical, you need to use classification
Some famous regression algorithms are:

  • Linear Regression
  • Logistic Regression
  • Polynomial Regression
  • Stepwise Regression

Some famous classification algorithms are:

  • Naive Bayes Classifier
  • Nearest Neighbor
  • Decision Trees
  • Random Forest
  • Neural Networks

All algorithms will generate a model/formula, Creating the model is often refered to as training
You can store the models in a variable and use them later on to predict.

Using Naive Bayes Classifier

# library(e1071)
# naive_bayes_model <- naiveBayes(predictive_variable ~ ., data = data)
# predicted_values_from_naive_bayes_model <- predict(naive_bayes_model, data)
# tab <- table(predicted_values_from_naive_bayes_model,data$predictive_variable)
# print(tab)
# 1 - sum(diag(tab)) / sum(tab)

Using Decision Tree

# library(party)
# decision_tree_model <- ctree(predictive_variable ~ ., data = data)
# predicted_values_from_decision_tree_model <- predict(decision_tree_model, data)
# tab <- table(predicted_values_from_decision_tree_model,data$predictive_variable)
# print(tab)
# 1 - sum(diag(tab)) / sum(tab)

Using Random Forest

# library(randomForest)
# random_forest_model <- randomForest(predictive_variable ~ ., data = data)
# plot(random_forest_model)
# predicted_values_from_random_forest_model <- predict(random_forest_model, data)
# tab <- table(predicted_values_from_random_forest_model,data$predictive_variable)
# print(tab)
# 1 - sum(diag(tab)) / sum(tab)

Publishing the insights


drawing Rmarkdown allows you to create reproducible results that can be shared as a HTML, PDF, PPT or Word format.
drawing Shiny allows you to create interactive web applications.

Exercise problem


Analyze iris dataset and come up with interesting insights about the data. Also predict the Species of these data points

newData <- data.frame(
    Sepal.Length = c(5.7, 6.3, 7.2),
    Sepal.Width = c(4.4, 2.9, 3.1),
    Petal.Length = c(1.4, 4.0, 5.1),
    Petal.Width = c(0.2, 1.0, 2.3),
    Species = ""
)
newData
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.7         4.4          1.4         0.2        
## 2          6.3         2.9          4.0         1.0        
## 3          7.2         3.1          5.1         2.3

Numeric Data

Sepal.Length
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sepal.Length
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sepal.Length
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sepal.Length
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Categorical Data

Species
ggplotly(ggplot(iris %>% group_by(Species) %>% summarise(count = n()), aes(x = reorder(Species, count), y = count)) + geom_col() + coord_flip())

Using only one variable to determine the type of species

Trying to understand the relationships between two variables

Sepal length and Sepal width
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()

Sepal length and Petal length
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + geom_point()

Sepal length and Petal width
ggplot(iris, aes(x = Sepal.Length, y = Petal.Width, color = Species)) + geom_point()

Sepal width and Petal length
ggplot(iris, aes(x = Sepal.Width, y = Petal.Length, color = Species)) + geom_point()

Sepal width and Petal width
ggplot(iris, aes(x = Sepal.Width, y = Petal.Width, color = Species)) + geom_point()

Petal length and Petal width
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point()

Predict the Species of these corresponding data

actual_species <- c("setosa", "versicolor", "virginica")
newData
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.7         4.4          1.4         0.2        
## 2          6.3         2.9          4.0         1.0        
## 3          7.2         3.1          5.1         2.3

Using Naive Bayes Classifier

# library(e1071)
# naive_bayes_model <- naiveBayes(Species ~ ., data = iris)
# predicted_values_from_naive_bayes_model <- predict(naive_bayes_model, iris)
# tab <- table(predicted_values_from_naive_bayes_model,iris$Species)
# print(tab)
# 1 - sum(diag(tab)) / sum(tab)
# # Predicting the a newData
# answer <- predict(naive_bayes_model, newData)
# tab <- table(answer,actual_species)
# print(tab)
# 1 - sum(diag(tab)) / sum(tab)